import pandas as pd
import numpy as np
airbnb = pd.read_csv('https://raw.githubusercontent.com/ishaandey/node/master/week-4/workshop/airbnb.csv') # For Seattle only
As with all new datasets, let's start by familiarizing ourselves with the dataset:
Try it! Print the shape, columns, and show a sample observation
print(airbnb.shape)
print(airbnb.columns.values)
airbnb.sample()
(7237, 13) ['name' 'host_id' 'neighbourhood_group' 'neighbourhood' 'latitude' 'longitude' 'room_type' 'price' 'minimum_nights' 'number_of_reviews' 'reviews_per_month' 'calculated_host_listings_count' 'availability_365']
| name | host_id | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1478 | Alki Beach Cottage on the Water | 30365082 | West Seattle | Alki | 47.57894 | -122.41211 | Entire home/apt | 120 | 3 | 84 | 1.63 | 4 | 359 |
Some imports: Note that we'll rename plotly express as px.
plotly express is a "wrapper" for the base plotly package. What that means is we can use incredibly easy and readable functions, and plotly express will do the hard work of convering that input into formats that the software can understand.
Quick aside: If you're a web developer and love JS, or an academic and use R, the same Plotly library is available to use in both languages.
import plotly
import plotly.express as px
Let's start off with a simple scatter plot, which we can whip up with px.scatter()
What does the association between price and availability look like?
airbnb_sample = airbnb.tail(1500)
fig = px.scatter(airbnb_sample, x='availability_365', y='price')
fig.show()
It works, but doesn't really tell us too much. Let's modify the plot by adding parameters to px.scatter()
With any python package, we can pull up some quick documentation from Jupyter itself using ?
Try it! What parameters does px.scatter accept?
px.scatter?
fig = px.scatter(airbnb_sample, x='availability_365', y='price',
opacity=0.3, marginal_y = 'histogram',
color='room_type',
)
fig.show()
So we're still not seeing much of a clear trend here.
There are, however, quite a few outliers in the price. Let's see if we can adjust our graph so the rest of the data isnt squished down.
fig = px.scatter(airbnb_sample, x='availability_365', y='price',
opacity=0.3, range_y = (1,1050),
marginal_y = 'histogram', marginal_x = 'histogram',
color='room_type',
)
fig.show()
Peep the histogram on the right, that shows a pretty neat trend with the room types. We can check that out in more depth later.
Those outliers were causing us a bit of trouble, but wasn't too hard to deal with.
But that does make me a bit curious: What was so special about those listings?
The power of plotly is that we can use the interactvity to literally just hover over the data points to see what's going on.
All we have to do is suggest what features to display:
See if you can find out which parameters can be used to show text on hover:
fig = px.scatter(airbnb_sample, x='availability_365', y='price',
opacity=0.3, color='room_type', log_y=True,
hover_name='name', hover_data=['neighbourhood_group', 'number_of_reviews']
)
fig.show()
Plotly is interactive! Play around with the legends and plot area.
Double click on the legend icon on the right, and plotly will automatically update the figure to select those points only.
We can change our colors fairly easily using color scales.
If the feature we pass to color= is discrete or categorical, we'll add the color_discrete_sequence param
If the feature is instead continuous, we'll use the color_continuous_scale param instead
Open the docs, and try out your favorite below:
fig = px.scatter(airbnb_sample, x='availability_365', y='price',
opacity=0.3, color='room_type', log_y=True,
hover_name='name', hover_data=['neighbourhood_group', 'number_of_reviews'],
color_discrete_sequence=plotly.colors.qualitative.Prism,
)
fig.show()
Under the hood, we can see that each of these sequences are just lists of colors, so we could subset them to use different values
plotly.colors.qualitative.Prism[2:6]
['rgb(56, 166, 165)', 'rgb(15, 133, 84)', 'rgb(115, 175, 72)', 'rgb(237, 173, 8)']
To finish off, we can add titles, labels and such pretty easily.
See if you can use the function documentation or google to figure out how to do that:
fig = px.scatter(airbnb_sample, x='availability_365', y='price',
opacity=0.3, color='room_type', log_y=True,
hover_name='name', hover_data=['neighbourhood_group', 'number_of_reviews'],
color_discrete_sequence=plotly.colors.sequential.BuPu_r[3:],
title='Seattle Airbnb Prices vs. Demand, Broken Down by Room Type',
labels={'availability_365':'Days Available Per Year',
'price':'Nightly Rate ($)',
'room_type':'Type of Room'},
)
fig.show()
Oftentimes we'll want to create visualizations at some aggregate level.
For example, let's say we want to show neighborhoods with a high median rental price.
Our data is at a per-listing level, meaning that each individual row is its own listing, with its price.
To get data at the per-neighborhood level, we've got to roll up all the listing prices per neighborhood, in other words, group the data by neighborhood, then find the median for all those listings.
airbnb_byN = airbnb.groupby(by=['neighbourhood','neighbourhood_group']).agg('median').reset_index()
airbnb_byN.head(3)
| neighbourhood | neighbourhood_group | host_id | latitude | longitude | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Adams | Ballard | 25139423.0 | 47.671920 | -122.38498 | 107.0 | 2.0 | 31.0 | 1.90 | 1.0 | 113.0 |
| 1 | Alki | West Seattle | 48206652.5 | 47.575725 | -122.40784 | 112.5 | 2.0 | 34.5 | 2.02 | 1.0 | 113.5 |
| 2 | Arbor Heights | West Seattle | 55322823.5 | 47.511915 | -122.37911 | 109.5 | 2.0 | 29.5 | 1.70 | 1.0 | 143.0 |
Now, let's drop the columns that make no sense to have a median of.
airbnb_byN = airbnb_byN.drop(columns=['host_id','latitude', 'longitude'])
airbnb_byN.head(3)
| neighbourhood | neighbourhood_group | price | minimum_nights | number_of_reviews | reviews_per_month | calculated_host_listings_count | availability_365 | |
|---|---|---|---|---|---|---|---|---|
| 0 | Adams | Ballard | 107.0 | 2.0 | 31.0 | 1.90 | 1.0 | 113.0 |
| 1 | Alki | West Seattle | 112.5 | 2.0 | 34.5 | 2.02 | 1.0 | 113.5 |
| 2 | Arbor Heights | West Seattle | 109.5 | 2.0 | 29.5 | 1.70 | 1.0 | 143.0 |
In breakout groups, see if you can (1) build a bar plot to show median prices in each neighbourhood group, and sort them in a meaningful way
Make it complete! Label axes, hover text, color, the whole nine yards.
fig = px.bar(airbnb_byN.sort_values(by='neighbourhood_group'),
x='neighbourhood', y='price',
log_y=False,
color='neighbourhood_group',
hover_name='neighbourhood_group',
color_discrete_sequence=plotly.colors.qualitative.Prism,
title='Seattle Airbnb Prices across Neighborhoods',
labels={'availability_365':'Days Available Per Year',
'price':'Median Nightly Rate ($)',
'neighbourhood':'Neighborhood',
'neighbourhood_group':'Region'
},
)
fig.show()
Say my friend and I have a budget of of $90 per night. Show which regions are ideal for this, but how you wanna do that is entirely up to you: Draw a horizontal line, color the bars by color the ideal regions differently, as long as it communicates the which neighborhoods are generally cheaper.
Hint: To draw a line, use fig.add_vline() with corresponding parameters
Hint: To color bars according to some condition, first create a new column that describes if the value is below budget.
budget = 85
airbnb_byN['budget'] = (airbnb_byN.price <= budget).map({True:'Under Budget', False:'Over Budget'})
fig = px.bar(airbnb_byN.sort_values(by='neighbourhood_group'),
x='neighbourhood', y='price',
log_y=False,
color='budget',
hover_name='neighbourhood_group',
color_discrete_sequence=['rgb(120,113,118)', 'rgb(191,65,67)'],
title='Seattle Airbnb Prices across Neighborhoods',
labels={'availability_365':'Days Available Per Year',
'price':'Median Nightly Rate ($)',
'neighbourhood':'Neighborhood',
'neighbourhood_group':'Region',
'budget':'Budget of ${}'.format(budget)
},
)
fig.add_hline(y=budget, line_width=2, line_dash="dash", line_color="black")
fig.show()
There's quite a few different ways to show geogrpahical data, usually with choropleth charts or scatter plots.
Our friend Plotly has them all: https://plotly.com/python/maps/
A quick note about how this work before letting you leaf through the docs page.
Most of the params in px.scatter_mapbox() behave pretty similarly to px.scatter, except that we provide latitude and longitude data instead of x and y. Luckily, our dataset already has that included, but oftentimes we'll have to find a lookup table online to convert city names, for example, to lat / lon coordinates.
We don't necessarily have to provide a value to size=, but that usually can help highlight points of interest.
zoom= on the other hand, just changes how zoomed in the initial picture is when first loaded.
Finally, we'll have to update the mapbox_style= parameter of the figure to a specific base map to load.
For more information on what options are available here, check out https://plotly.com/python/mapbox-layers/
airbnb['budget'] = (airbnb.price <= budget).map({True:'Under Budget', False:'Over Budget'})
fig = px.scatter_mapbox(airbnb, lat='latitude', lon='longitude',
color='neighbourhood_group', size='price', opacity=.6,
hover_name='name',hover_data=['neighbourhood','budget'],
color_discrete_sequence=plotly.colors.qualitative.Prism_r,
zoom=10, labels={'availability_365':'Days Available Per Year',
'price':'Median Nightly Rate ($)',
'neighbourhood':'Neighborhood',
'neighbourhood_group':'Neighborhood Region',
'budget':'Budget of ${}'.format(budget),
'latitude':'Lat', 'longitude':'Lon'
},
)
fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
Show places in Downtown, Central Area, and Capitol Hill, and highlight those under budget
fig = px.scatter_mapbox(airbnb[airbnb.neighbourhood_group.isin(['Downtown', 'Central Area','Capitol Hill'])],
lat='latitude', lon='longitude',
color='budget', opacity=.6,
hover_name='name',hover_data=['price','neighbourhood','budget'],
color_discrete_map={'Over Budget':'rgb(204,204,204)', 'Under Budget':'rgb(191,65,67)'},
zoom=12, labels={'availability_365':'Days Available Per Year',
'price':'Median Nightly Rate ($)',
'neighbourhood':'Neighborhood',
'neighbourhood_group':'Neighborhood Region',
'budget':'Budget of ${}'.format(budget),
'latitude':'Lat', 'longitude':'Lon'
},
)
fig.update_layout(mapbox_style="carto-positron")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()